-
Notifications
You must be signed in to change notification settings - Fork 13.9k
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574
Conversation
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks;
I don't see why wee need to add this new CLI. The mtmd-cli can do this with -p and --image params
convert_hf_to_gguf.py
Outdated
|
|
||
| # Top-level direct mappings | ||
| if src_no_vm == 'cls_token': | ||
| return [('v.cls_token', data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper mapping instead
tools/mtmd/clip.cpp
Outdated
| if (!ctx->jinaclip_rope_initialized) { | ||
| const int half_dim = rope_dim / 2; | ||
| std::vector<float> base_freqs(half_dim); | ||
| for (int i = 0; i < half_dim; i++) { | ||
| float arange_val = i * 2.0f; // [0, 2, 4, ..., 30] | ||
| float normalized = arange_val / rope_dim; // [0, 2/32, 4/32, ..., 30/32] | ||
| float theta_powered = powf(freq_base, normalized); // theta^normalized | ||
| base_freqs[i] = 1.0f / theta_powered; // 1.0 / theta^normalized | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're trying to do here, is this just 2D RoPE? (which we already supported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn’t re‑implementing generic 2D RoPE; it implements JinaCLIP’s VisionRotaryEmbeddingFast.
It uses fractional‑position 2D RoPE (t = arange(ft)/ft * pt) and precomputes a full H×W cos/sin grid; the official 2D RoPE uses integer grid positions (pos_h/pos_w) with ggml_rope_ext and does not include these steps.
This is done to strictly match Jina’s Python semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fractional‑position 2D RoPE (t = arange(ft)/ft * pt)
Based on your code:
time_seq[i] = (float) i / ft_seq_len * pt_seq_len; // [0, 16/36, 32/36, ..., 560/36]
...
freqs_h[t * half_dim + f] = time_seq[t] * base_freqs[f];
Then why don't we scale base_freqs[f] instead? The third param of ggml_rope_ext, the c tensor (freq_scale) is made for this purpose.
Honestly I think this is just YaRN
fd37a5c to
9d02918
Compare
9d02918 to
e19eb27
Compare
e19eb27 to
2d8885b
Compare
2d8885b to
b9f78de
Compare
b9f78de to
2787888
Compare
46f9ee2 to
542ed6a
Compare
445e0d5 to
bd46020
Compare
|
@pockers21 What's up? |
I’m currently adjusting the code and fixing issues. I originally planned to answer your questions together when |
7e0b15b to
2338880
Compare
a429271 to
e039046
Compare
…icubic;switch to 'jinaclip2'; fix converter constants
9898603 to
f5b8651
Compare
|
@pockers21 You need to address the tensor mappings, as pointed out by @ngxson, use |
eefff38 to
a475f81
Compare
a475f81 to
ff3cfc0
Compare
f86e9fd to
76782e1
Compare
76782e1 to
a2fef90
Compare
…icubic;switch to 'jinaclip2'; fix converter constants
Remove unnecessary try/except Jina text hparams. Co-authored-by: Sigbjørn Skjæret <[email protected]>
a2fef90 to
6617024
Compare
Done, please review again. |
|
Hmmm, there's a major issue with conversion, the This means that:
I tried kludging it by copying in values, but I got several other failures, so it's just not working... |
The original Jina model is a single multi-modal checkpoint that contains both text and vision components, and the text side includes a LoRA head. In our workflow, we did two things:
If you want to run conversion, you should follow the layout used here: Concretely, our implementation assumes that you:
Here, ORIG_IMAGE_PATH must point to the split_jina/image directory. |
Looking forward to your feedback. |
|
TBH, I'm not sure this is acceptable, I would expect to be able to convert the original model, granted it's a little tricky due to the way it's constructed, but should be doable. It might be acceptable to have a preprocessing script for it, but that's not ideal, @ngxson any opinions? |
Update Notes (2025‑11‑6)
block_count/projection_dim/feed_forward_length/attention.head_count.
Reproduction
Minimal commands & data (CPU)
jina-bert-v3.pooling_type = MEAN/CLS/LASTclip.projector_type = jinaclip2,clip.vision.rope_theta = 10000(default)CUDA_VISIBLE_DEVICES= ./build/bin/llama-embedding -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0 --pooling mean --embd-normalize 2 --embd-output-format arraypython3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa offCUDA_VISIBLE_DEVICES= ./build/bin/llama-mtmd-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0 --embd-normalize 2 --embd-output-format arraypython3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa offmtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)
Overview
common_embd_normalize(..., 2).llama-jinaclip-cli(built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.Scope of changes
clip.projector_type=jinaclip,clip.vision.rope_theta(configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.clip_n_output_tokens()returns 1 for JinaCLIP;clip_n_mmproj_embd()returns projection_dim.llama-jinaclip-clitarget (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.Validation summary
ci/run.shpasses locally; no ggml op changes in this PR.encode_msand thread scaling; no regression observed. More data can be added if requested.Performance (absolute metrics, CPU-only minimal samples)
GPU group (absolute metrics, minimal samples)